{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Optimization Algorithms\n",
    "\n",
    "Training a neural network consists of modifying the network’s parameters to minimize the cost function on the training set. In principle, any kind of optimization algorithm could be used. In practice, modern neural networks are almost always trained with some variant of stochastic gradient descent(SGD). Here we will provide two optimization algorithms: SGD and Adam optimizer.\n",
    "\n",
    "## Stochastic Gradient Descent\n",
    "\n",
    "The goal of an optimization algorithm is to find the value of the parameter to make loss function very low. For some types of models, an optimization algorithm might ﬁnd the global minimum value of loss function, but for neural network, the most efficient way to converge loss function to a local minimum is to minimize loss function according to each example.\n",
    "\n",
    "Gradient descent uses the following update rule to minimize loss function:"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "$$\\theta^{(t+1)} = \\theta^{(t)}-\\alpha\\nabla_\\theta L(\\theta^{(t)})$$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "where t is the time step of the algorithm and $\\alpha$ is the learning rate. But this rule could be very costly when $L(\\theta)$ is defined as a sum across the entire training set. Using SGD can accelerate the learning process as we can use only a batch of examples to update the parameters. \n",
    "\n",
    "We implemented the gradient descent algorithm, which can be viewed with the following code:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Using TensorFlow backend.\n"
     ]
    }
   ],
   "source": [
    "import os, sys\n",
    "sys.path = [os.path.abspath(\"../../\")] + sys.path\n",
    "from deep_learning4e import *\n",
    "from notebook4e import *"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<!DOCTYPE html PUBLIC \"-//W3C//DTD HTML 4.01//EN\"\n",
       "   \"http://www.w3.org/TR/html4/strict.dtd\">\n",
       "\n",
       "<html>\n",
       "<head>\n",
       "  <title></title>\n",
       "  <meta http-equiv=\"content-type\" content=\"text/html; charset=None\">\n",
       "  <style type=\"text/css\">\n",
       "td.linenos { background-color: #f0f0f0; padding-right: 10px; }\n",
       "span.lineno { background-color: #f0f0f0; padding: 0 5px 0 5px; }\n",
       "pre { line-height: 125%; }\n",
       "body .hll { background-color: #ffffcc }\n",
       "body  { background: #f8f8f8; }\n",
       "body .c { color: #408080; font-style: italic } /* Comment */\n",
       "body .err { border: 1px solid #FF0000 } /* Error */\n",
       "body .k { color: #008000; font-weight: bold } /* Keyword */\n",
       "body .o { color: #666666 } /* Operator */\n",
       "body .ch { color: #408080; font-style: italic } /* Comment.Hashbang */\n",
       "body .cm { color: #408080; font-style: italic } /* Comment.Multiline */\n",
       "body .cp { color: #BC7A00 } /* Comment.Preproc */\n",
       "body .cpf { color: #408080; font-style: italic } /* Comment.PreprocFile */\n",
       "body .c1 { color: #408080; font-style: italic } /* Comment.Single */\n",
       "body .cs { color: #408080; font-style: italic } /* Comment.Special */\n",
       "body .gd { color: #A00000 } /* Generic.Deleted */\n",
       "body .ge { font-style: italic } /* Generic.Emph */\n",
       "body .gr { color: #FF0000 } /* Generic.Error */\n",
       "body .gh { color: #000080; font-weight: bold } /* Generic.Heading */\n",
       "body .gi { color: #00A000 } /* Generic.Inserted */\n",
       "body .go { color: #888888 } /* Generic.Output */\n",
       "body .gp { color: #000080; font-weight: bold } /* Generic.Prompt */\n",
       "body .gs { font-weight: bold } /* Generic.Strong */\n",
       "body .gu { color: #800080; font-weight: bold } /* Generic.Subheading */\n",
       "body .gt { color: #0044DD } /* Generic.Traceback */\n",
       "body .kc { color: #008000; font-weight: bold } /* Keyword.Constant */\n",
       "body .kd { color: #008000; font-weight: bold } /* Keyword.Declaration */\n",
       "body .kn { color: #008000; font-weight: bold } /* Keyword.Namespace */\n",
       "body .kp { color: #008000 } /* Keyword.Pseudo */\n",
       "body .kr { color: #008000; font-weight: bold } /* Keyword.Reserved */\n",
       "body .kt { color: #B00040 } /* Keyword.Type */\n",
       "body .m { color: #666666 } /* Literal.Number */\n",
       "body .s { color: #BA2121 } /* Literal.String */\n",
       "body .na { color: #7D9029 } /* Name.Attribute */\n",
       "body .nb { color: #008000 } /* Name.Builtin */\n",
       "body .nc { color: #0000FF; font-weight: bold } /* Name.Class */\n",
       "body .no { color: #880000 } /* Name.Constant */\n",
       "body .nd { color: #AA22FF } /* Name.Decorator */\n",
       "body .ni { color: #999999; font-weight: bold } /* Name.Entity */\n",
       "body .ne { color: #D2413A; font-weight: bold } /* Name.Exception */\n",
       "body .nf { color: #0000FF } /* Name.Function */\n",
       "body .nl { color: #A0A000 } /* Name.Label */\n",
       "body .nn { color: #0000FF; font-weight: bold } /* Name.Namespace */\n",
       "body .nt { color: #008000; font-weight: bold } /* Name.Tag */\n",
       "body .nv { color: #19177C } /* Name.Variable */\n",
       "body .ow { color: #AA22FF; font-weight: bold } /* Operator.Word */\n",
       "body .w { color: #bbbbbb } /* Text.Whitespace */\n",
       "body .mb { color: #666666 } /* Literal.Number.Bin */\n",
       "body .mf { color: #666666 } /* Literal.Number.Float */\n",
       "body .mh { color: #666666 } /* Literal.Number.Hex */\n",
       "body .mi { color: #666666 } /* Literal.Number.Integer */\n",
       "body .mo { color: #666666 } /* Literal.Number.Oct */\n",
       "body .sa { color: #BA2121 } /* Literal.String.Affix */\n",
       "body .sb { color: #BA2121 } /* Literal.String.Backtick */\n",
       "body .sc { color: #BA2121 } /* Literal.String.Char */\n",
       "body .dl { color: #BA2121 } /* Literal.String.Delimiter */\n",
       "body .sd { color: #BA2121; font-style: italic } /* Literal.String.Doc */\n",
       "body .s2 { color: #BA2121 } /* Literal.String.Double */\n",
       "body .se { color: #BB6622; font-weight: bold } /* Literal.String.Escape */\n",
       "body .sh { color: #BA2121 } /* Literal.String.Heredoc */\n",
       "body .si { color: #BB6688; font-weight: bold } /* Literal.String.Interpol */\n",
       "body .sx { color: #008000 } /* Literal.String.Other */\n",
       "body .sr { color: #BB6688 } /* Literal.String.Regex */\n",
       "body .s1 { color: #BA2121 } /* Literal.String.Single */\n",
       "body .ss { color: #19177C } /* Literal.String.Symbol */\n",
       "body .bp { color: #008000 } /* Name.Builtin.Pseudo */\n",
       "body .fm { color: #0000FF } /* Name.Function.Magic */\n",
       "body .vc { color: #19177C } /* Name.Variable.Class */\n",
       "body .vg { color: #19177C } /* Name.Variable.Global */\n",
       "body .vi { color: #19177C } /* Name.Variable.Instance */\n",
       "body .vm { color: #19177C } /* Name.Variable.Magic */\n",
       "body .il { color: #666666 } /* Literal.Number.Integer.Long */\n",
       "\n",
       "  </style>\n",
       "</head>\n",
       "<body>\n",
       "<h2></h2>\n",
       "\n",
       "<div class=\"highlight\"><pre><span></span><span class=\"k\">def</span> <span class=\"nf\">gradient_descent</span><span class=\"p\">(</span><span class=\"n\">dataset</span><span class=\"p\">,</span> <span class=\"n\">net</span><span class=\"p\">,</span> <span class=\"n\">loss</span><span class=\"p\">,</span> <span class=\"n\">epochs</span><span class=\"o\">=</span><span class=\"mi\">1000</span><span class=\"p\">,</span> <span class=\"n\">l_rate</span><span class=\"o\">=</span><span class=\"mf\">0.01</span><span class=\"p\">,</span>  <span class=\"n\">batch_size</span><span class=\"o\">=</span><span class=\"mi\">1</span><span class=\"p\">):</span>\n",
       "    <span class=\"sd\">&quot;&quot;&quot;</span>\n",
       "<span class=\"sd\">    gradient descent algorithm to update the learnable parameters of a network.</span>\n",
       "<span class=\"sd\">    :return: the updated network.</span>\n",
       "<span class=\"sd\">    &quot;&quot;&quot;</span>\n",
       "    <span class=\"c1\"># init data</span>\n",
       "    <span class=\"n\">examples</span> <span class=\"o\">=</span> <span class=\"n\">dataset</span><span class=\"o\">.</span><span class=\"n\">examples</span>\n",
       "\n",
       "    <span class=\"k\">for</span> <span class=\"n\">e</span> <span class=\"ow\">in</span> <span class=\"nb\">range</span><span class=\"p\">(</span><span class=\"n\">epochs</span><span class=\"p\">):</span>\n",
       "        <span class=\"n\">total_loss</span> <span class=\"o\">=</span> <span class=\"mi\">0</span>\n",
       "        <span class=\"n\">random</span><span class=\"o\">.</span><span class=\"n\">shuffle</span><span class=\"p\">(</span><span class=\"n\">examples</span><span class=\"p\">)</span>\n",
       "        <span class=\"n\">weights</span> <span class=\"o\">=</span> <span class=\"p\">[[</span><span class=\"n\">node</span><span class=\"o\">.</span><span class=\"n\">weights</span> <span class=\"k\">for</span> <span class=\"n\">node</span> <span class=\"ow\">in</span> <span class=\"n\">layer</span><span class=\"o\">.</span><span class=\"n\">nodes</span><span class=\"p\">]</span> <span class=\"k\">for</span> <span class=\"n\">layer</span> <span class=\"ow\">in</span> <span class=\"n\">net</span><span class=\"p\">]</span>\n",
       "\n",
       "        <span class=\"k\">for</span> <span class=\"n\">batch</span> <span class=\"ow\">in</span> <span class=\"n\">get_batch</span><span class=\"p\">(</span><span class=\"n\">examples</span><span class=\"p\">,</span> <span class=\"n\">batch_size</span><span class=\"p\">):</span>\n",
       "\n",
       "            <span class=\"n\">inputs</span><span class=\"p\">,</span> <span class=\"n\">targets</span> <span class=\"o\">=</span> <span class=\"n\">init_examples</span><span class=\"p\">(</span><span class=\"n\">batch</span><span class=\"p\">,</span> <span class=\"n\">dataset</span><span class=\"o\">.</span><span class=\"n\">inputs</span><span class=\"p\">,</span> <span class=\"n\">dataset</span><span class=\"o\">.</span><span class=\"n\">target</span><span class=\"p\">,</span> <span class=\"nb\">len</span><span class=\"p\">(</span><span class=\"n\">net</span><span class=\"p\">[</span><span class=\"o\">-</span><span class=\"mi\">1</span><span class=\"p\">]</span><span class=\"o\">.</span><span class=\"n\">nodes</span><span class=\"p\">))</span>\n",
       "            <span class=\"c1\"># compute gradients of weights</span>\n",
       "            <span class=\"n\">gs</span><span class=\"p\">,</span> <span class=\"n\">batch_loss</span> <span class=\"o\">=</span> <span class=\"n\">BackPropagation</span><span class=\"p\">(</span><span class=\"n\">inputs</span><span class=\"p\">,</span> <span class=\"n\">targets</span><span class=\"p\">,</span> <span class=\"n\">weights</span><span class=\"p\">,</span> <span class=\"n\">net</span><span class=\"p\">,</span> <span class=\"n\">loss</span><span class=\"p\">)</span>\n",
       "            <span class=\"c1\"># update weights with gradient descent</span>\n",
       "            <span class=\"n\">weights</span> <span class=\"o\">=</span> <span class=\"n\">vector_add</span><span class=\"p\">(</span><span class=\"n\">weights</span><span class=\"p\">,</span> <span class=\"n\">scalar_vector_product</span><span class=\"p\">(</span><span class=\"o\">-</span><span class=\"n\">l_rate</span><span class=\"p\">,</span> <span class=\"n\">gs</span><span class=\"p\">))</span>\n",
       "            <span class=\"n\">total_loss</span> <span class=\"o\">+=</span> <span class=\"n\">batch_loss</span>\n",
       "            <span class=\"c1\"># update the weights of network each batch</span>\n",
       "            <span class=\"k\">for</span> <span class=\"n\">i</span> <span class=\"ow\">in</span> <span class=\"nb\">range</span><span class=\"p\">(</span><span class=\"nb\">len</span><span class=\"p\">(</span><span class=\"n\">net</span><span class=\"p\">)):</span>\n",
       "                <span class=\"k\">if</span> <span class=\"n\">weights</span><span class=\"p\">[</span><span class=\"n\">i</span><span class=\"p\">]:</span>\n",
       "                    <span class=\"k\">for</span> <span class=\"n\">j</span> <span class=\"ow\">in</span> <span class=\"nb\">range</span><span class=\"p\">(</span><span class=\"nb\">len</span><span class=\"p\">(</span><span class=\"n\">weights</span><span class=\"p\">[</span><span class=\"n\">i</span><span class=\"p\">])):</span>\n",
       "                        <span class=\"n\">net</span><span class=\"p\">[</span><span class=\"n\">i</span><span class=\"p\">]</span><span class=\"o\">.</span><span class=\"n\">nodes</span><span class=\"p\">[</span><span class=\"n\">j</span><span class=\"p\">]</span><span class=\"o\">.</span><span class=\"n\">weights</span> <span class=\"o\">=</span> <span class=\"n\">weights</span><span class=\"p\">[</span><span class=\"n\">i</span><span class=\"p\">][</span><span class=\"n\">j</span><span class=\"p\">]</span>\n",
       "\n",
       "        <span class=\"k\">if</span> <span class=\"p\">(</span><span class=\"n\">e</span><span class=\"o\">+</span><span class=\"mi\">1</span><span class=\"p\">)</span> <span class=\"o\">%</span> <span class=\"mi\">10</span> <span class=\"o\">==</span> <span class=\"mi\">0</span><span class=\"p\">:</span>\n",
       "            <span class=\"k\">print</span><span class=\"p\">(</span><span class=\"s2\">&quot;epoch:{}, total_loss:{}&quot;</span><span class=\"o\">.</span><span class=\"n\">format</span><span class=\"p\">(</span><span class=\"n\">e</span><span class=\"o\">+</span><span class=\"mi\">1</span><span class=\"p\">,</span><span class=\"n\">total_loss</span><span class=\"p\">))</span>\n",
       "    <span class=\"k\">return</span> <span class=\"n\">net</span>\n",
       "</pre></div>\n",
       "</body>\n",
       "</html>\n"
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "psource(gradient_descent)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "There several key elements need to specify when using a `gradient_descent` optimizer:\n",
    "\n",
    "- **dataset**: A dataset object we used in the previous chapter, such as `iris` and `orings`.\n",
    "- **net**: A neural network object which we will cover in the next chapter.\n",
    "- **loss**: The loss function used in representing accuracy.\n",
    "- **epochs**: How many rounds the training set is used.\n",
    "- **l_rate**: learning rate.\n",
    "- **batch_size**: The number of examples is used in each update. When very small batch size is used, gradient descent and be treated as SGD."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Adam Optimizer\n",
    "\n",
    "To mitigate some of the problems caused by the fact that the gradient ignores the second derivatives, some optimization algorithms incorporate the idea of momentum which keeps a running average of the gradients of past mini-batches. Thus Adam optimizer maintains a table saving the previous gradient result.\n",
    "\n",
    "To view the pseudocode and the implementation, you can use the following codes:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "pseudocode(adam_optimizer)\n",
    "psource(adam_optimizer)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "There are several attributes to specify when using Adam optimizer that is different from gradient descent: rho and delta. These parameters determine the percentage of the last iteration is memorized. For more details of how this algorithm work, please refer to the article [here](https://arxiv.org/abs/1412.6980).\n",
    "\n",
    "In the Stanford course on deep learning for computer vision, the Adam algorithm is suggested as the default optimization method for deep learning applications: \n",
    ">In practice Adam is currently recommended as the default algorithm to use, and often works slightly better than RMSProp. However, it is often also worth trying SGD+Nesterov Momentum as an alternative."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Backpropagation\n",
    "\n",
    "The above algorithms are optimization algorithms: they update parameters like $\\theta$ to get smaller loss values. And back-propagation is the method to calculate the gradient for each layer. For complicated models like deep neural networks, the gradients can not be calculated directly as there are enormous array-valued variables.\n",
    "\n",
    "Fortunately, back-propagation can calculate the gradients briefly which we can interpret as calculating gradients from the last layer to the first which is the inverse process to the forwarding procedure. The derivation of the loss function is passed to previous layers to make them changing toward the direction of minimizing the loss function."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"images/backprop.png\" width=\"500\"/>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Applying optimizers and back-propagation algorithm together, we can update the weights of a neural network to minimize the loss function with alternatively doing forward and back-propagation process. Here is a figure form [here](https://medium.com/datathings/neural-networks-and-backpropagation-explained-in-a-simple-way-f540a3611f5e) describing how a neural network updates its weights:\n",
    "\n",
    "<img src=\"images/nn_steps.png\" width=\"700\"></img>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In our implementation, all the steps are integrated into the optimizer objects. The forward-backward process of passing information through the whole neural network is put into the method `BackPropagation`. You can view the code with:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "psource(BackPropagation)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The demonstration of optimizers and back-propagation algorithm will be made together with neural network learners."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.9"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
